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Abstract 

We examine the possibility that recent 
promising results in automatic caption 
generation are due primarily to language 
models. By varying image representation 
quality produced by a convolutional neu¬ 
ral network, we find that a state-of-the- 
art neural captioning algorithm is able to 
produce quality captions even when pro¬ 
vided with surprisingly poor image rep¬ 
resentations. We replicate this result in 
a new, fine-grained, transfer learned cap¬ 
tioning domain, consisting of 66K recipe 
image/title pairs. We also provide some 
experiments regarding the appropriateness 
of datasets for automatic captioning, and 
find that having multiple captions per im¬ 
age is beneficial, but not an absolute re¬ 
quirement. 

1 Introduction 

Describing the content of an image is an easy task 
for humans, but, until recently, had been difficult 
or impossible for computers. Recent work in com¬ 
puter vision has addressed this task of automati¬ 
cally generating the caption of an input image with 
promising results (Farhadi et al., 2010; Kulkarni et 
al., 2013; Ordonez et al., 2011; Karpathy and Li, 
2014; Mao et al., 2014; Vinyals et al., 2014; Kiros 
et al., 2014; Donahue et al., 2014; Fang et al., 
2014). Several state-of-the-art approaches couple 
a pre-trained deep convolutional neural network 
(CNN) for image representation with a recurrent 
neural network (RNN) to generate captions that 
describe image content. 

We consider the possibility that the generation 
of these captions, however, is not heavily reliant 
upon the image representation input. For instance, 
if one was to train a RNN directly on image cap¬ 
tions, one could learn a fair amount about the 


general language of image captions. Sutskever 
et al. (2011) demonstrate that RNNs are capa¬ 
ble of producing diverse and surprisingly read¬ 
able sentences, given a short starting sequence of 
seed words. Furthermore, non-neural memoiza- 
tion techniques like those proposed by Wood et 
al. (2009) and Gasthaus et al. (2010) are capable 
of producing very convincing language models for 
particular domains. 

While it is clear that existing algorithms do dis¬ 
criminate based on image inputs, it is still unclear 
if the apparently highly specific generated cap¬ 
tions are primarily a result of language modeling 
rather than image modeling. If it could be de¬ 
termined that either image modeling or language 
modeling is acting as the bottleneck in this mul¬ 
timodal setting, research efforts could be directed 
appropriately. 

To examine the relative multimodal model¬ 
ing capacities of existing neural captioning algo¬ 
rithms, we execute a series of experiments where 
we vary image representation quality produced 
from a fixed CNN, and examine how the output 
captions are affected. 

For two existing datasets and a new domain we 
analyze here, our results suggest that caption qual¬ 
ity does not scale well with increased classifica¬ 
tion accuracy of a fixed CNN. In fact, as the test¬ 
ing/validation accuracy of a CNN with fixed archi¬ 
tecture increases, all seven caption evaluation met¬ 
rics we consider appear to saturate at surprisingly 
low classification accuracies. While this does not 
prove that better image modeling algorithms could 
not produce better captions, it appears that many 
apparently fine-grained aspects of generated nat¬ 
ural language are the result of surprisingly coarse 
grained visual distinctions. 

For a fixed vision model, our results indicate 
that there is likely little room for caption improve¬ 
ment via gathering more training images alone. 
We further postulate that progress could be made 



most quickly through the development of language 
modeling techniques that take better advantage of 
existing image representations. In particular, cou¬ 
pling our results with independent but consistent 
observations made by Karpathy and Li (2014) and 
Vinyals et al. (2014) regarding model modifica¬ 
tions that lead to overfitting, it’s very likely that 
overfitting language models to image features is 
still a big problem for many caption generation al¬ 
gorithms. Our analysis highlights what we believe 
to be an important question for these types of al¬ 
gorithms going forward: if better image represen¬ 
tations contain useful, fine-grained information, is 
it possible to take advantage of that information 
without overfitting? 

To supplement our analysis of image represen¬ 
tations, we consider a new caption generating task: 
generating recipe titles based on images of food. 
The motivation for this new task results from the 
intuition that image representations might matter 
more in visually fine-grained domains, where al¬ 
gorithms must be able to discriminate between 
minute changes in the input images. We col¬ 
lect a dataset consisting of images of food cou¬ 
pled with recipe titles (e.g. “thai chicken curry”) 
from Yummly . com for this purpose. When com¬ 
pared to captioning the coarse-grained ImageNet 
domain, the specificity of our food dataset calls 
for more subtle visual discrimination. 

Instead of learning a food image representing 
CNN from scratch to derive representations, we 
apply transfer learning on a dataset of 101K food 
images. Using this approach, we significantly 
surpass current state-of-the-art performance for a 
classification task on this dataset, despite using 
a somewhat outdated deep architecture. We fur¬ 
ther demonstrate that this transfer learning process 
does indeed improve food captioning, though we 
observe a similar “flattening” of all linguistic eval¬ 
uation metrics, after a point. 

2 Related Work 

2.1 Automatic Captioning 

The model we choose to analyze in detail is the 
“Neural Image Captioning” (NIC) model detailed 
by Vinyals et al. (2014), though we believe the 
experiments we address here are relevant to re¬ 
searchers working on distinct but related models. 
In a similar fashion to Donahue et al. (2014) 
and Karpathy and Li (2014), NIC feeds a pre¬ 
classification representation of images produced 


by an architecture like GoogLeNet (Szegedy et 
al., 2014) or AlexNet (Krizhevsky et al., 2012) 
to a LSTM recurrent neural network (Hochreiter 
and Schmidhuber, 1997) for language generation. 
The RNN weights are usually trained on datasets 
consisting of pairs of images and several corre¬ 
sponding human-generated annotations, such as 
FlickrSk (Hodosh et al., 2013), FlickrSOk (Young 
et al., 2014), or Microsoft COCO (Lin et al., 
2014). The CNN is often pre-trained on a very 
large set of images such as ImageNet (Deng et al., 
2009) and held fixed while the RNN is trained. 
For many existing captioning datasets, ImageNet 
is a convenient starting point, presumably because 
images in most modern captioning datasets are of 
similar objects. 

More complicated caption generation mod¬ 
els have also demonstrated success on several 
datasets. To the knowledge of the authors. Fang 
et al. (2014) hold the current best result (in terms 
of BLEU-4) on the MSCOCO official captioning 
test set, though Vinyals et al. (2014) reportedly 
outperform Fang et al. on 2/5 evaluation met¬ 
rics detailed on the MSCOCO captioning leader- 
board.^ Their pipeline involves training a lan¬ 
guage model directly on captions and a discretized 
image representation consisting of a likely set of 
objects in that image. Switching from a fine- 
tuned AlexNet (Krizhevsky et al., 2012) to a fine- 
tuned VGG-net (Simonyan and Zisserman, 2014) 
improved BLEU-4 by 2.4 points, and METEOR 
by 1.4 points. Because their image representa¬ 
tions were discrete, it’s possible that their lan¬ 
guage models were less prone to overfitting. It’s 
not immediately obvious that a similar improve¬ 
ment would occur for language models that oper¬ 
ate on extracted vector representations of images 
like NIC, however. 

In contrast to the previous approaches that pro¬ 
vide their RNNs with a representation of an image 
only at the first timestep, Mao et al. (2014) pro¬ 
pose an extension of a single-layer RNN, dubbed 
the “multimodal RNN,” that feeds a representation 
of an image to the RNN at every word generation 
step. Finally, Kiros et al. (2014) propose a model 
that first uses a CNN and an RNN to embed an 
image and its corresponding caption in the same 
semantic space, and then feeds vectors from this 
space into a “language generating structure con¬ 
tent neural language model”, an extension of a 

^mscoco.org/dataset/ 




Figure 1: Word cutoff versus log-scale vocab size 
per image. This metric captures both dataset size 
and vocabulary size and shows that Yummly has 
the smallest vocabulary by a margin. 

multiplicative RNN that “disentangles the struc¬ 
ture of a sentence to its content.” 

Among models that directly input extracted fea¬ 
tures to a generating RNN, it is clear that image 
representations can be mishandled. Specifically, 
several authors note that passing image represen¬ 
tations to the RNN at every timestep empirically 
leads to worse performance. While Karpathy and 
Li (2014) do not offer speculation as to why this 
is the case, Vinyals et al. (2014) briefly mention 
that this operation leads to over-fitting. These in¬ 
dependent observations demonstrate that it is easy 
to overfit to image features. 

2.2 Caption Evaluation Metrics 

To evaluate captions, we use BLEU-{1,2,3,4} (Pa- 
pineni et al., 2002) METEOR (Denkowski and 
Lavie, 2014) and CIDEr/CIDEr-D (Vedantam et 
al., 2014). BLEU-n is a precision measure over 
n-grams, whereas METEOR is a more sophisti¬ 
cated metric that involves the computation of an 
alignment between candidate and reference cap¬ 
tions; both were originally conceived in the con¬ 
text of machine translation. CIDEr/CIDEr-D was 
created to evaluate captions of images and focuses 
on consensus, particularly in cases where there are 
multiple reference captions. 

2.3 Recipe Title Prediction Tasks 

To extend the scope of our investigation, we com¬ 
pile a dataset consisting of images of food cou¬ 
pled with recipe titles from Yummly.com. In 
this dataset, the title of a recipe is usually several 
words long and can be thought of as a “summary” 
of the image, rather than a direct description, as 


not all image content is described in the caption. 
The image associated with “garlic butter shrimp,” 
for instance, contains shrimp, a bowl, a lemon, 
and a human hand, and the captioning algorithms 
must learn to pick out which items are important 
to describe. Furthermore, there is less grammati¬ 
cal structure present in this dataset. 

We view this task as distinct from existing cap¬ 
tioning tasks for three reasons. First, the cap¬ 
tions within Yummly are both short and restricted; 
a caption in the Yummly setting has an average 
length of 4.5 words, which is very low compared 
to Flickr or MSCOCO settings (both have an av¬ 
erage of 10 words per caption) and the vocabu¬ 
lary is very small (see Figure 1). Second, to ad¬ 
dress this data fully, models must learn very fine¬ 
grained visual distinctions. Compared to the broad 
ImageNet domain, the Yummly images generally 
consist of some food item on a plate, coupled with 
several words from a small vocabulary. Finally, 
this dataset contains a single caption for each im¬ 
age, thus the learning task is more difficult. Previ¬ 
ous work (Hodosh et al., 2013) has emphasized the 
importance of having multiple captions per image 
in a caption ranking setting, though its unclear if 
similar observations extend to a generation setting. 

While we are only aware of the work of Mal- 
maud et al. (2015) that address food in a mul¬ 
timodal fashion, Bossard et al. (2014) compile 
the Food 101 dataset which generalizes and in¬ 
creases the scale of previous food image datasets 
(i.e. Chen et al. (2009), Yang et al. (2010)). Their 
dataset includes 101k images of 101 types of foods 
and the task they address is classification. 

2.4 Choosing a CNN/RNN Architecture 

While substantial improvements have been made 
in terms of classification accuracy on ImageNet 
using increasingly deep architectures, we rely 
on the canonical neural network described in 
Krizhevsky et al. (2012) to generate our repre¬ 
sentations in most of our experiments. The use of 
AlexNet in particular allows for more direct com¬ 
parison with previous work (i.e. Bossard et al. 
(2014)) and faster training time when compared 
to other deep models. This is beneficial particu¬ 
larly because our experiments are not specifically 
designed to produce state-of-the-art results. 

We perform 20 random parameter searches to 
determine decent parameter settings using the 








Figure 2: Transfer learned Food-101 CNN accuracy across various classes in the dataset, presented for 
easy comparison with Figure 6 in Bossard et al. (2014). In general, this model finds the same classes 
difficult to classify as the models described in previous work, suggesting that some types of fine-grained 
distinctions are difficult for many models. 


Neuraltalk ^ library for all captioning experiments, 
selecting parameter settings resulting in the low¬ 
est validation set perplexity, unless specified other¬ 
wise. Settings we take as fixed include a minimum 
vocabulary threshold of 5, weight optimization us¬ 
ing RMSprop (Tieleman and Hinton, 2012), and 
a hidden representation size of 256. We restrict 
our consideration to NIC because we believe it to 
be representative of the state-of-the-art in neural 
captioning. When we are evaluating models, we 
generate captions using a beam search of width 
20. For the recipe title prediction evaluation, we 
include an end-of-caption token to avoid issues re¬ 
lating to predicted zero length captions; this has 
the result of artificially inflating evaluation metrics 
such that numerical cross-dataset comparisons are 
not valid. 

2.5 Adapting the Food CNN through 
Transfer Learning 

To represent food images properly, we find it ap¬ 
propriate to learn a model specific to the task of 
food recognition. Food-101 (Bossard et al., 2014) 
consists of only 10IK images, which is a rela¬ 
tively low number of images to train a CNN from 
scratch. As such, we use a set of ImageNet-trained 
weights as initializations for our training of a CNN 
on the Food-101 classification task. This process 
is commonly referred to as transfer learning (Caru- 
ana, 1995; Bengio, 2012). 

The intuition behind transfer learning in CNNs 
is that low-level features learned early on in the 
base network (which are generally observed to be 

^github.com/karpathy/neuraltalk 


color blob and Gabor features (Yosinski et al., 
2014)) are useful to networks trained on diverse 
classification tasks. Initializing the weights of the 
network to weights successful in another classifi¬ 
cation task should allow training of the new net¬ 
work to converge faster and to a better local opti¬ 
mum than if random initializations were used. 

In fact, for the Food-101 dataset, we achieve 
a rank-1 accuracy of 66.80% when using transfer 
learning, when compared with the 56.40% rank- 
1 accuracy reported by Bossard et al. (2014) us¬ 
ing the same AlexNet architecture; class-by-class 
accuracies are given in Figure 2 for comparison 
with previous work. Our network is learned using 
only 100k iterations of the Caffe library at a re¬ 
duced learning rate, whereas training from scratch 
required Bossard et al. 450k iterations. For our 
tuning process, we follow the guidelines and pa¬ 
rameter settings specified by the transfer learning 
example distributed with Caffe.^ 

Once the network is tuned, we compute 4096 
dimensional vector representations for each image 
in Yummly dataset by extracting the network acti¬ 
vations in the final fully-connected layer. 

3 Yummly Dataset: Description and 
Baselines 

After establishing that a CNN could be transfer 
learned to classify images of dishes at state-of-the- 
art performance, we were able to shift our focus to 
caption generation in a food domain. 

The food dataset we collect contains roughly 
66K recipes, each consisting of a single image- 

^https://github.com/BVLC/caffe 













































































deviled eggs 
loQprob: -1.52 


lasagna 

logprob: -4.02 



pulled pork sandwiches 
logprob: -3.43 



french dip sandwiches 
logprob: -5.78 



Ipaghetti carbonara 
ogprob: -3.81 


egg salad sandwiches 
logprob: -6.05 



Figure 3: Examples of the captioning system output on several images. The first row of images represents 
images that are well captioned. The second row represents different types of images the system believes 
to be sandwiches. The third row represents images that the system has captioned incorrectly. 


recipe pair. This data was taken from 
Yummly . com, a website that aggregates and per¬ 
forms analysis of millions of recipes. Out of the 
66K recipes, 6K are reserved for testing, 6K are 
designated as a validation set, and the remaining 
54K are used for model training. 

This dataset differs from the Flickr datasets and 
MSCOCO both in terms of vocabulary and in 
terms of image content. The vocabulary size per 
image is smaller than any of the other datasets by 
a wide margin (see Figure 1). While it’s clear the 
vision task requires more subtle distinction when 
compared to ImageNet, because the average cap¬ 
tion length is shorter, it’s ambiguous as to whether 
or not the Yummly language generation task is par¬ 
ticularly “fine-grained.” 

3.1 Baseline Results 

Table 1 presents some baseline results using the 
algorithms listed. Common-3 predicts a reason¬ 
able ordering of the three most common words 
(“with chicken and”) for all captions. Nearest 
neighbor predicts the caption of nearest neigh¬ 


bor in the transfer-learned 4096-dimensional em¬ 
bedding space. Common-Tri/Bi predict the most 
common tri/bigram in our dataset (“macaroni and 
cheese”/“ice cream”) for all images. 


Across the board, and particularly for BLEU- 
{2,3,4} scores, the caption generating programs 
outperform all baselines, which suggests the pro¬ 
posed task is adequately framed. However, it is 
worth noting that only roughly 300/6117 (roughly 
5%) of generated captions are unique. This is 
rather low when compared with a representative 
result for FlickrSk, a dataset of similar size, where 
200/1000 (roughly 20%) of generated captions 
are unique. It might be possible to re-frame the 
Yummly generation task as one of classification, 
however, it’s not obvious how one might drive a 
fixed set of labels. In a later section we discuss 
whether or not only having one caption per image 
or other dataset features is a contributing factor to 
this result. 
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Figure 4: Classification accuracy of CNN versus seven different normalized (100 is best possible) lin¬ 
guistic criteria for both the transfer learned (left) and directly learned (right) domains. 



B-1 

B-2 

B-3 

B-4 

Com-3 

14.2 

2.7 

0.8 

0.0 

N-Neigh 

20.5 

2.5 

0.6 

0.0 

Com-Tri 

30.4 

6.5 

3.4 

2.2 

Com-Bi 

35.4 

8.9 

5.2 

0.0 

Karpathy and Li 
(2014) 

42.7 

19.6 

11.9 

13.2 

Vinyals et al. 
(2014) 

46.2 

23.1 

14.8 

10.2 


Table 1: Yummly baseline BLEU-{ 1,2,3,4} scores 
for several baselines and two high performing lan¬ 
guage generation algorithms. 


4 Image Representations 

4.1 Experiment Descriptions 

We vary image representation quality as follows: 
for the FlickrSk and FlickrSOk datasets, we com¬ 
pute the representations given by snapshots of 
AlexNet taken mid-training on the ILSVRC2012 
(Russakovsky et al., 2015) task. We use snap¬ 
shots taken at intervals of 10k from Ok (random 
initialization) to 100k iterations. While this range 
of iterations is before the model has entirely con¬ 
verged, the rank-1 classification accuracy of the 
trained CNN over the ImageNet validation set in¬ 
creases from roughly 0% to over 40% during this 
time (after the model converges at 450k iterations, 
the rank-1 validation accuracy is 57%). From 
the standpoint of examining representation qual¬ 
ity, this set of snapshots is important because this 
is likely where the network is learning most of 
its layer-by-layer abstractions, and the behavior of 


the network after 100k iterations can be extrapo¬ 
lated based on the data we analyze here. 

In a similar fashion, for Yummly we com¬ 
pute representations generated by snapshots of the 
transfer learned network at intervals of 10k from 
Ok to 90k, though our starting point is a fully- 
converged CNN that produces 57% rank-1 accu¬ 
racy on ImageNet’s validation set. 

We train 5 NIC models from a random initial¬ 
ization per CNN for FlickrSk and Yummly, and 
2-4 NIC models per CNN for FlickrSOk. Ev¬ 
ery data point described in the following section 
is the result of up to six days of parallel com¬ 
putation using a modern 4/8-core machine. It 
should be noted that test/validation accuracy of 
these CNNs is not monotonically increasing with 
snapshot number. While the trend is that training 
CNNs for more iterations results in higher accu¬ 
racy, there is some noise. For instance, for the 
Food-101 transfer learned CNN, rank-1 test accu¬ 
racy drops from 61% to 60% over the snapshots 
extracted at 10k and 20k iterations respectively, 
before abruptly jumping to 66% testing accuracy 
in the next 10k iterations. 

4.2 Results 

We evaluate predicted captions using seven cap¬ 
tion evaluation metrics, namely, BLEU-{1,2,3,4}, 
METEOR, and CIDEr/CIDEr-D. Figure 4 shows 
our main results for both the directly learned and 
transfer learned domains. In both cases, all cap¬ 
tioning metrics appear to level off early, and do 
not improve significantly with increased classifi¬ 
cation rate after a point. This suggests that weight 











































settings for a fixed CNN with higher classification 
rates are unlikely to produce significantly better 
captions in terms of these seven evaluation met¬ 
rics, after a point. 

To quantify this lack of improvement, for each 
dataset we select a CNN that performs its asso¬ 
ciated visual classification task relatively poorly, 
and compare it to all better-classifying CNNs. For 
FlickrSk, for instance, we consider a CNN that 
produces 30.5% rank-1 accuracy on ImageNet’s 
validation set, and compare its caption perfor¬ 
mance against that of 8 “better” CNNs that achieve 
between 34.6% and 41.7% accuracy; there are a 
total of 56 comparisons, in this case. 

Though it is difficult to compute accurate 
statistics with only 5 observations in each 
group, we conduct three separate statistical tests, 
each with different variance/normality assump¬ 
tions/efficiencies. The tests we perform are Stu¬ 
dents’ t-test, Mann-Whitney U-test, and Welch’s 
unpaired t-test. 

In the case of FlickrSk, there are very few sig¬ 
nificant differences between the 30.5%-CNN and 
more accurate CNNs. In fact, in 14/56 cases (in¬ 
cluding half the time among BLEU-1/2 scores) the 
lower classifying CNN actually produced better 
captions. The results significant at the 5% level for 
any statistical test suggested that the 38%-CNN 
outperformed the 30.5%-CNN in terms of BLEU- 
1/2, and that the 39.5%-CNN outperformed the 
30.5%-CNN in terms of METEOR. 

The results for Flickr30k were very similar to 
the results for FlickrSk. In Figure 5 we present 
results from this dataset presented against CNN it¬ 
eration number rather than CNN classification ac¬ 
curacy. We modify the presentation of our data 
simply to demonstrate that caption quality and it¬ 
eration number (not just testing/validation accu¬ 
racy) are also apparently independent after a point. 
No evidence of improvement was observed after 
the 30.5%-CNN, though only 2-4 observations per 
CNN could be made due to computational restric¬ 
tions. 

In total, in the directly-learned domain 
(Flickr8k/30k) all metrics appear to saturate after 
AlexNet reaches 30% classification accuracy over 
the ImageNet validation set. It is possible that 
training to convergence could result in slightly 
higher quality captions. However, our results 
indicate that efforts on ImageNet which result 
in less than a roughly 10% rank-1 classification 



Figure 5: Caption quality versus CNN iteration (in 
thousands of iters) that representations were de¬ 
rived from. It is clear that a caption quality satura¬ 
tion happens very early on, and there is little to no 
improvement in captions as the CNNs are trained 
for more time. 

accuracy increase for a fixed network are likely 
not worth undertaking if one’s end goal is higher 
quality captions. 

In the transfer learned domain, it is clear that 
domain adaptation improves caption quality, even 
after a small number of iterations. All statistical 
tests for all evaluation metrics indicate a highly 
significant difference (p < .01) between captions 
generated by a CNN trained directly on ImageNet, 
and one that has been transfer-learned using Food- 
101 for just lOK iterations (producing a rank-1 
testing accuracy of 61.2% on that dataset). After 
a point, however, we observe the same indepen¬ 
dence of caption quality and classification accu¬ 
racy. 

It seems that “knowing more” about the image 
does not help the RNN generate more accurate 
captions after a point because the language pat¬ 
terns it learns are sufficient. This result is akin to 
prior work (e.g. Sutskever et al. (2011)) which 
demonstrates that RNNs are able to generate rea¬ 
sonable natural language, given a relatively weak 
seeding signal. The “weak” signal in this case is 
provided by image representations, rather than by 
a short sequence of starting words. 

4.3 The Effect of Changing CNN 
Architectures 

Our analysis thus-far has focused on a single im¬ 
age model, AlexNet, for extracting image repre¬ 
sentations. In this experiment, we compare the 


















captions generated on FlickrSk when using an im¬ 
proved CNN. We train 15 NIC models based on 
features extracted from a fully converged AlexNet, 
and 15 NIC models based on features extracted 
form a fully converged 16-layer VGGNet (Si- 
monyan and Zisserman, 2014). The former model 
produces a 57.1% rank-1 accuracy over Ima- 
geNet’s validation set, while the later outperforms 
this mark, producing 75.6% rank-1 validation ac¬ 
curacy. The default train/valiation/test split of 
6k/lk/lk images is used for training. 

Our results are summarized in Table 2. In addi¬ 
tion to the seven caption evaluation metrics we’ve 
used in previous experiments, this table also in¬ 
cludes the proportion of the Ik generated captions 
that are unique, and the train/validation perplexi¬ 
ties. 

Counter-intuitively, we find that, despite pro¬ 
ducing 18% lower rank-1 validation accuracy 
across ImageNet’s validation set, AlexNet gener¬ 
ates better captions than VGG net by all evalua¬ 
tion metrics. Notably, the models using VGG fea¬ 
tures produce lower perplexity across the valida¬ 
tion split. Because we used validation perplexity 
as a metric for hyperparameter selection, it’s likely 
that the VGG net models are overfitting to the par¬ 
ticular FlickrSk validation split we used. However, 
the AlexNet trained models do not suffer a simi¬ 
lar performance degradation. Here, it appears that 
not overfitting to image features is more important 
than taking advantage of very detailed image rep¬ 
resentations. 

Our results from this experiment illustrate that 
better image representations might actually cause 
models like NIC to become more prone to overfit¬ 
ting. It’s possible, too, that the early saturation of 
caption quality observed in the previous sections 
could be primarily due to overfitting. Future work 
would be well suited to evaluate different methods 
of hyperparameter selection. 

4.4 One caption per image? 

We conclude with a final experiment to address 
one potential shortcoming of domains similar to 
Yummly, where one is only able to extract a single 
caption per image. Though Yummly differs from 
the other datasets we explore in several ways (cap¬ 
tion length/vocab size) a fundamental question 
arises from its examination: for a fixed amount of 
training data, is it better to have more captions per 
image, or more images with single captions? In 


short, we hope to experimentally examine Hodosh 
et al.’s (2013) suggestion that having multiple cap¬ 
tions per image is vital. 

To address this question, we use FlickrSOk, 
which provides five captions per image. We subset 
this dataset in two ways. In the first, we remove 4 
captions randomly from each image in the train¬ 
ing set, but keep all images (the “more images” 
method). In the second, we randomly remove 80% 
of training images, but keep all 5 captions for the 
remaining (the “more captions” method). This 
subsetting scheme is such that the overall num¬ 
ber of image/caption pairs is the same between 
both methods, but the training data is of a different 
form. 

We extract image representations from the Im- 
ageNet CNN at 100k iterations (which produces 
roughly 40% rank-1 classification accuracy over 
the ImageNet validation set) and train NIC on 6 
random datasets constructed via the “more im¬ 
ages” subsetting method, and 7 random datasets 
constructed via the “more captions” subsetting 
method. Finally, we generate captions and com¬ 
pare performance. A good hyperparameter setting 
for FlickrSOk is borrowed from the random search 
conducted over the whole dataset experiments de¬ 
scribed in the previous section. 

Our findings, summarized in Table 3, gener¬ 
ally align with the accepted notion that having 
more captions and less images is better than hav¬ 
ing more images with single captions. For all 
seven evaluation metrics, the mean score for the 
models trained on the “more captions” datasets 
was greater than the mean score for the models 
trained on the “more images” datasets, and the re¬ 
sults were significant at the 5% level for all three 
statistical tests in the case of BLEU-1 and BLEU- 
2. Interestingly, for CIDEr/CIDEr-D, the results 
were somewhat significant (all 6 p-values less than 
.15) but the results for METEOR were the least 
significant (all 3 p-values greater than .94). 

The validation perplexity of the “more images” 
method is lower when compared to the more cap¬ 
tions method, whereas the training perplexity is 
higher. Despite the fact that the output captions 
are better overall, this is an indication that hav¬ 
ing multiple captions per image can actually make 
NIC more prone to overfitting. 

Finally, the NIC models trained on the “more 
caption” subsets produced higher proportions of 
unique captions on the test set. This suggests 




AlexNet 

VGG 

Ibp-l ImageNet 
Val Acc 

57.1% 

75.6% 

B-1 

54.187 

53.913 

B-2 

33.967 

33.527 

B-3** 

20.640 

20.007 


12.833 

12.213 

METEOR 

14.559 

14.559 

CIDEr 

32.416 

31.362 

CIDEr-D* 

26.200 

25.242 

Proportion 

Unique*** 

20.5% 

17.0% 

Training 

Perplexity*** 

10.79 

11.04 

Validation 

Perplexity*** 

17.84 

17.66 


Table 2: Effect on caption quality when using 
the fully converged AlexNet and VGGNet on 
FlickrSk. Significance for all 3 statistical tests that 
there was a true difference between the subsetting 
techniques: < .001, < .01, < .05 

that the single-caption per image feature of the 
Yummly dataset contributed to a lack of caption 
innovation. 

Despite only having one caption per image, 
however, NIC was still able to produce good re¬ 
sults on the single-captioned subsets. This indi¬ 
cates that quality captioning datasets can be built 
with only one caption per image. The number of 
additional images one needs to gather to compen¬ 
sate for this feature, however, is likely greater than 
the number of captions one would need to add to 
existing images. 

5 Conclusion 

We demonstrate the relationship between CNN 
classification accuracy and the quality of captions 
generated by a state of the art neural captioning 
algorithm. Training increasingly accurate image 
classifiers does not lead to better captions, after a 
point. This early saturation of caption quality is an 
indication that the performance of neural caption 
generating algorithms likely cannot be increased 
directly by producing more accurate CNNs. Fur¬ 
thermore, many of the apparently highly-specific 
generated captions output by models like NIC are 
likely due to language models capturing coarse 
grained information and generating corresponding 
plausible natural language sequences. 

The role of overfitting to image features is dif¬ 



More Captions 

More Images 

B-1** 

55.167 

54.243 

B-2* 

33.567 

32.814 

B-3 

20.633 

20.300 

B-4 

13.133 

13.014 

METEOR 

13.105 

13.096 

CIDEr 

2E428 

20.418 

CIDEr-D 

16.350 

15.550 

Proportion 

Unique** 

14.8% 

9.96% 

Training 

Perplexity** 

14.69 

16.01 

Validation 

Perplexity* 

25.86 

25.33 


Table 3: Evaluations for the NIC models trained 
on subsets of Flickr30k containing more captions 
(5 captions per image, 1/5 the total number of im¬ 
ages) and more images (1 caption per image, all 
training images). Significance for all 3 statistical 
tests that there was a true difference between the 
subsetting techniques: < .01, < .05 

ficult to quantify. On one hand, there is extra in¬ 
formation contained in image representations that 
NIC, for instance, does not take advantage of, and 
even commonly overfits to. However, it’s not clear 
that this extra, fine-grained information is even 
worth taking into account. The success of mod¬ 
els that generate language based on discretized 
image representations (e.g. (Young et al., 2014)) 
demonstrates that algorithms are capable of state- 
of-the-art performance without consideration of 
rich, real-valued vector features. It’s likely that 
these types of models are less prone to overfitting, 
as well. 
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